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ABSTRACT 

Pfam is a widely used database of protein families, 
currently containing more than 13000 manually 
curated protein families as of release 26.0. Pfam is 
available via servers in the UK (http://pfam.sanger 
.ac.uk/), the USA (http://pfam.janelia.org/) and 
Sweden (http://pfam.sbc.su.se/). Here, we report 
on changes that have occurred since our 2010 
NAR paper (release 24.0). Over the last 2 years, we 
have generated 1840 new families and 
increased coverage of the UniProt Knowledgebase 
(UniProtKB) to nearly 80%. Notably, we have 
taken the step of opening up the annotation of our 
families to the Wikipedia community, by linking 
Pfam families to relevant Wikipedia pages and 
encouraging the Pfam and Wikipedia communities 
to improve and expand those pages. We continue 
to improve the Pfam website and add new visualiza- 
tions, such as the 'sunburst' representation of 
taxonomic distribution of families. In this work we 
additionally address two topics that will be of 
particular interest to the Pfam community. First, 
we explain the definition and use of family-specific, 
manually curated gathering thresholds. Second, we 
discuss some of the features of domains of 
unknown function (also known as DUFs), which con- 
stitute a rapidly growing class of families within 
Pfam. 



INTRODUCTION 

Pfam is a database of protein families, where families are 
sets of protein regions that share a significant degree of 
sequence similarity, thereby suggesting homology. 
Similarity is detected using the HMMER3 (http:// 
hmnier.janeha.org/) suite of programs. 

Pfam contains two types of families: high quality, 
manually curated Pfam-A famihes and automatically 
generated Pfam-B families. The latter are derived from 
clusters produced by the ADDA algorithm (1), followed 
by the subtraction of overlapping Pfam-A regions at each 
release. Pfam-A families are built foUowing what is, in 
essence, a four-step process: 

(i) building of a high-quahty multiple sequence ahgn- 
ment (the so-called seed ahgnment); 

(ii) constructing a profile hidden Markov model (HMM) 
from the seed alignment (using HMMER3); 

(in) searching the profile HMM against the UniProtKB 
sequence database (2) and 

(iv) choosing family-specific sequence and domain 
gathering thresholds (GAs); all sequence regions 
that score above the GAs are included in the full 
alignment for the family (GAs are described in 
detail in a later section of this paper). 

In addition to providing matches to UniProtKB, Pfam 
also provides matches for the NCBI non-redundant 
database, as well as a collection of metagenomic 
samples. We generate a variety of data downstream, 
including, among others, a family sequence-conservation 
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logo based on the HMM, a description of domain archi- 
tectures, where all co-occurrences with other domains are 
reported, and a species tree summarizing the taxonomic 
range in the family. 

The quality of the seed ahgnment is the crucial factor in 
determining the quahty of the Pfam resource, influencing 
not only aU data generated within the database but also 
the outcome of external searches that use our profile 
HMMs, e.g. to assign domains to proteins which are 
part of newly sequenced genomes. For this reason, a con- 
siderable curatorial effort goes into seed alignment 
generation. 

Members of the same Pfam family are expected to share 
a common evolutionary history and thus at least some 
functional aspect. Ideally, our families should represent 
functional units, which, when combined in different 
ways, can generate proteins with unique functions. The 
ultimate goal of Pfam is to create a collection of function- 
ally annotated families that is as representative as possible 
of protein sequence-space, such that our families can be 
used effectively for both genome-annotation and 
small-scale protein studies. It must be stressed, however, 
that homology is no guarantee of functional similarity and 
transfer of functional annotation based solely on family 
membership should always be undertaken with caution. 
On the other hand, additional data that are available 
from Pfam, such as conservation of family signature 
residues or conservation of common domain architec- 
tures, can increase confidence in a given functional 
hypothesis. For more background on how to 
query and use our web interface please refer to Coggill 
et al. (3). 

In this paper, we report on the most recent Pfam release 
(26.0) as well as on important changes that have been 
introduced over the last 2 years, since our 2010 NAR 
database issue paper (where we presented release 24.0) 
(4). Arguably, the change carrying the most significant 
philosophical implications has been the decision to 
follow the lead of the Rfam database (5) and out-source 
functional annotation of Pfam families to Wikipedia. We 
will discuss the background to this decision and give 
details of the progress towards Wikipedia coverage of 
Pfam families. Another important development has been 
the adoption of the iterative sequence-search program 
jackhmmer (6) as our principal tool for generating new 
families. In addition, we have extended our mechanism 
for family curation, which now allows trained and 
trusted external collaborators to create and add their 
own famihes to Pfam. Finally, we will take this opportun- 
ity to address and present fresh analysis on two topics that 
we consider of particular importance: family-specific GAs 
and Domains of Unknown Function (DUFs). 

WHAT'S NEW 

Community annotation 

Using Wikipedia as a repository for protein family 
annotation. Historically, Pfam has provided only a basic 
level of textual annotation for each family. This has 
included a few sentences, with references, designed to 



give users an overview of the function(s) of the family or 
domain. However, rather than have the Pfam curators 
describe our families, we would strongly prefer to have 
annotations written by those who know the proteins and 
families best, namely the biologists and informaticians 
who work with them on a day-to-day basis. Harnessing 
the knowledge of these experts remains a significant 
challenge. 

One recent approach has been to use Wikipedia as a 
source of scientific information (7,8). Wikipedia is the 
world's largest online encyclopedia, with over 3.7 milhon 
English language articles, and is widely acknowledged to 
be the most popular general reference work on the 
internet. A cornerstone of Wikipedia is that anyone can 
edit the content. 

The Rfam database moved aU of its family annota- 
tion into articles in Wikipedia in 2009, thereby 
allowing anyone to freely edit and improve their 
content. This experiment has proved successful, 
engaging the wider scientific community to provide 
expert annotations and improving the overall quality of 
the Rfam annotation (7). In fight of this positive experi- 
ence, we decided to adopt the same approach and use 
Wikipedia as the primary source of Pfam annotation 
(Figure lA). 

The Rfam database included around 600 families when 
the switch to Wikipedia was made. This made it feasible to 
assign existing articles where possible and to generate new 
'stub' articles in Wikipedia for any family that still lacked 
any relevant article. These stubs have since been gradually 
expanded and improved by the Rfam and Wikipedia 
communities. At the time when Pfam began using 
Wikipedia for annotations, release 25.0, there were 
12 273 Pfani-A families. We initially identified existing 
articles that described protein domains or famihes and 
which provided useful information about the Pfam 
family. These articles are now assigned as the primary 
annotation for the appropriate famihes. Given the 
number of families that remain without an article, 
however, it is simply not feasible to manually gener- 
ate articles for all of them. For these families we 
continue to show the original annotation comments, 
which were written by the curator of the family, while 
encouraging our users to tell us about appropriate 
Wikipedia articles or to create them on our behalf. 
Furthermore, it is hkely that there will be some families 
that are not sufficiently notable for inclusion in Wikipedia 
and we anticipate that many of these will remain without 
Wikipedia annotations for the long term, perhaps 
indefinitely. 

As of release 26.0, there are 4909 Pfam families that hnk 
to 1016 Wikipedia articles. We invite readers and users of 
the Pfam website to edit and improve these articles in 
Wikipedia. Mapping of Pfam-A families to Wikipedia 
articles is available in JSON format, from: 
http://pfamsrv.sanger.ac.uk/cgi-bin/mapping.cgi7db 
= pfam. 

Some Wikipedia articles cover multiple Pfam families, 
such as the Zinc finger article or the Interleukin article. 
Pfam contains a large series of 3526 famihes noted as 
DUFs. Virtually aU of these DUF famihes link to a 
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Figure 1. New Pfam features since release 24.0. (A) The Pfam-A family page for Avidin (PF01382), showing the embedded contents of the associated 
Wikipedia article. The 'infobox' is highlighted. (B) The 'sunburst' representation of the tree showing the species distribution of the Pfam-A family 
Peptidase_M10 (PF00413). (C) The PfamAlyzer applet, showing the results of searching for all architectures that include the domains IMPDH and 
CBS. The PfamAlyzer applet allows querying of Pfam for proteins with particular domains, domain combinations or architectures. 



single Wikipedia article, which describes DUFs in general, 
and will do so until such time as their function is 
determined and they have an article of their own. 

Although the process of manually creating a new 
Wikipedia article can be time-consuming and difficult, 
we are keen to increase the number of Wikipedia- 
annotated families in Pfam as much as possible. We 
have therefore developed a pipehne to generate stub 
articles automatically in a Wikipedia 'sandbox', often 
taking existing family annotations from the InterPro 
database (9) as the basis for the article. These stubs can 
then be reviewed and edited by our curators, before being 
moved out of the sandbox into Wikipedia proper and used 
to annotate famihes. We have implemented several auto- 
mated procedures for augmenting the basic annotation 
text and expanding the content as far as possible before 
its final publication in Wikipedia. 

A particularly useful feature of Wikipedia is the high- 
hghting of terms within an article that are themselves 
described by another Wikipedia article. This network of 
linked terms allows readers to quickly understand the 
background to the article they are reading and, as such, 
they are crucial to the success of any article. To assist with 
the cross-hnking of our new, automatically generated 
Wikipedia articles, we took the initial set of ~700 Pfam 



Wikipedia articles and computationally collected a broad 
set of common terms. These terms were then automatic- 
ally marked as hnks in the stub articles. Another essential 
feature of a Wikipedia page is the reference hst. We used 
the Template Filler Perl module (http://search.cpan.org/ 
dist/WWW-Wikipedia-TemplateFiller/) to retrieve and 
include the full details of the references cited in the 
InterPro annotation that we used as our starting point. 
Finally, we have automatically populated the 'infobox' 
(http://en.wikipedia.Org/wiki/Help:Infobox) in our stub 
articles. This infobox (Figure lA), located on the 
right-hand side of Wikipedia protein family pages, 
shows images of the relevant three-dimensional structures, 
where available, and additional database links. When an 
image of the protein structure was available from 
Wikimedia commons (http://commons.wikimedia.org/ 
wiki/Main_Page), this was added to the top of the 
infobox, along with a caption. Further information was 
extracted from the Pfam database and added to the 
infobox, such as the Pfam clan accession and links to 
other database sites such as PROSITE (10), SCOP (11) 
and CAZy (12). 

Altogether, the automatic article creation process 
generated 7823 articles in our Wikipedia sandbox. We 
continue to review, edit and move these stub articles 
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into the main part of Wikipedia. In order to prioritize the 
best articles arising from the generation process, we 
calculated for each one a heuristic score, based on the 
size of the annotation, availability or otherwise of an 
image, the number of references and the number of links 
out to other databases. The score gave an overall measure 
of the level of information in the page and thus an indi- 
cation of its potential for addition to Wikipedia. Already 
>200 of the highest scoring articles have been moved 
across to generate new Wikipedia pages. 

One of the major concerns about Wikipedia generally is 
the risk of vandalism and dehberate errors being 
introduced into publicly edited articles. This is of particu- 
lar concern to both the Pfani and Rfani projects, since we 
'scrape' and re-display Wikipedia contents within our re- 
spective websites. In order to reduce the likelihood of 
blatant vandahsni or egregious errors propagating 
through to our websites, we include an additional 
approval process before displaying newly edited 
Wikipedia articles. Our curators review and, if necessary, 
revert changes to articles on a daily basis and only after an 
article has been reviewed it is flagged for update and pres- 
entation within the Pfam website. In our experience, 
almost every case of vandahsm is reverted by the 
Wikipedia community before we come to review the 
changes. Overall we have found that ~1% of afl edits 
are reverted by the Wikipedia community, suggesting an 
upper bound on the possible number of vandahsm edits. 

It is important to stress that the Wikipedia content dis- 
played in Pfam family pages is an exact copy of the article 
that can be found on the main Wikipedia website, subject 
to a delay of a day or so for the approval process 
described above. 

Family function annotation via the Pfam helpdesk. Val 
Wood of the Schizosaccharomyces pombe database, 
PomBase, routinely reports new findings from the htera- 
ture to the Pfam helpdesk (pfam-help@sanger.ac.uk). 
Over the last 12 months, 74 such communications have 
been received, of which at least four provided evidence 
for the function of a DUF. A further 16 concerned hits 
of newly characterized S. pombe sequences to Pfam-B 
families, thus leading to the building of at least that 
number of new families. 

One good example of a family that has been 
characterized in this way is DUF 1709, in which a fission 
yeast anilhn protein was characterized. Anilhn proteins 
are actin-binding proteins involved in septin-organization, 
which are localized to the cleavage-furrow during cell 
division (13). The DUF has been re-named Anillin 
(PF08174). 

Similarly, Pfam-B family PB008473 from Pfam release 
24.0 was found to contain a fission yeast protein, Mtr4 
(UniProtKB; Q9P795), which had been determined ex- 
perimentally to be an essential RNA helicase that 
performs a critical role as an activator of the nuclear 
exosome in RNA processing and degradation. From this 
finding (14), family rRNA_proc-arch (PF13234) was built 
and described. 

Other contributions from the community over the last 
year have included 51 direct annotation submissions 



(received via a web form available from the Pfam family 
pages) with suggestions for improvements and updates to 
the Pfam annotations; of these, 14 offered information 
about the function of DUFs, 4 of these coming from the 
InterPro team. A good example of how the functionality 
of InterPro benefits Pfam was a case where a team 
member flagged up one of our DUFs, DUF3462 
(PF 11945), as being the WASH subunit of the WASH 
complex (Wiskott-Aldrich Syndrome Protein and SCAR 
Homolog) that acts as an Arp2/3 activator necessary for 
Golgi-directed trafficking (15). The DUF was re-named as 
the WAHD domain of WASH complex. 

Examples of cases where the determination of the 
three-dimensional structure of sequences from bacterial 
DUFs has led to the discovery of function are detailed 
in a dedicated issue of Acta Crystallographica, Section 
F(\6). 

Extending the community of Pfam curators. Although 
Wikipedia offers a mechanism for external scientists to 
contribute annotations using an estabhshed mechanism, 
it is restricted purely to functional annotation. As 
outlined above, the helpdesk provides a way of making 
more substantial contributions to the database, but the 
fraction of new families derived from helpdesk submis- 
sions is relatively small. Furthermore, contributions 
from the helpdesk are often in a different format and/or 
use a different sequence database to that used by Pfam. 
More often than not, a Pfam curator has to spend time 
understanding and modifying the submission to conform 
to the Pfam data model, which makes the helpdesk a far 
from ideal interface for bulk submissions. 

Pfam is run by an international consortium of three 
groups, but until very recently our fundamental family 
data could be modified only at our Cambridge, UK, 
site. This has meant that even full consortium members 
have been unable to add their own families to Pfam. In 
order to remove this restriction, and with the goal of 
making it easier for members of the wider community to 
add families, we have developed a system that allows Pfam 
families to be added by registered users anywhere in the 
world. The distributed system involves the local installa- 
tion of our family building pipehne (a set of Perl scripts 
and modules) and various quahty control procedures. It 
allows the addition of new famihes and clans, as well as 
the modification of existing entries. Data are sent back- 
wards and forwards between the user and a central, master 
server using HTTPS, and we are able to authenticate all 
traffic that results in changes to the database. Files are 
maintained using the widely used Subversion revision 
control system (http://subversion.apache.org/), thereby 
preventing the inadvertent conflicts that could occur 
when multiple users wish to make changes in a distributed 
environment. 

Owing to the organizational changes detailed above, we 
have been able to embrace two external groups who work 
with data relevant to Pfam, giving them direct access to 
the Pfam submission pipehne. The Protein and Genome 
Evolution Research Group, run by L. Aravind at the 
National Center for Biotechnology Information (NCBI, 
USA), are experts in protein evolution, routinely 
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publishing articles on large evolutionary related 
superfamilies. The second external contributing group is 
the one of Adam Godzik at the Burnham Institute (USA), 
part of the Joint Center for Structural Genomics (JCSG). 

Members of both of these groups are now able to 
submit famihes and clans directly to Pfam, allowing 
them to improve Pfam data and to extend its reach to 
new communities and a broader audience. We see the 
introduction of this distributed curation model, in com- 
bination with the use of Wikipedia as a source of our 
annotations, as two important steps in making our 
database a community-based resource. Our goal is to 
provide an infrastructure that empowers scientists to con- 
tribute using whichever mechanism they feel most com- 
fortable, while still allowing us to maintain oversight 
and control of the quahty of our fundamental data. 

Generating new families using jackhmmer iterative 
searches 

The HMMER3 package (http://hmmer.janeha.org/) 
includes the jackhmmer (6) program for running iterative, 
profile HMM-based searches against a sequence database 
(PSl-BLAST-hke) starting with a single sequence. This, in 
parallel with curation of Pfam-B ahgnments, has become 
our main protocol for generating families. Sequences of 
interest are used as queries for a 3-iteration jackhmmer 
search. Seed alignments for new Pfam families are 
produced from the resulting jackhmmer multiple 
sequence alignment. In particular, we have apphed this 
protocol for family mining in a set of complete proteomes 
drawn from a wide taxonomic range (>50 proteomes 
overall). For a given proteome, every sequence lacking a 
match to a Pfam entry was used to initiate a jackhmmer 
search. 

Website changes 

Sunburst representation of the taxonomic tree. For each 
Pfam-A family we provide an interactive taxonomic tree, 
showing the species distribution of sequences in the 
family. However, due to the size of many families, this 
tree can be very large, making it difficult to gain a clear 
impression of the species distribution of the family. In 
order to address this problem, we have introduced a 
'sunburst' representation of the species trees, as shown 
in Figure IB. Sunbursts are a commonly used method of 
visualizing tree-like data sets, whereby the root of a tree is 
plotted as a circle, surrounded by concentric rings repre- 
senting child nodes. In Pfam, each node of the taxonomic 
tree is drawn as an arc, whose distance from the centre 
corresponds to the taxonomic level of the node and whose 
length (or, equivalently, the angle subtended by the arc) is 
scaled to represent either the number of sequences or the 
number of species belonging to that node in the tree. The 
switch between scahng according to numbers of sequences 
or species may be changed interactively using a control in 
the page. Arcs are coloured according to kingdom. As the 
mouse pointer is moved across the sunburst, a tool-tip 
shows a summary of the current node, giving the species 
name for that node, along with the number of species and 
number of sequences beneath it. A summary panel also 



shows a simple graphical representation of the hneage of 
the relevant node. The overall size of the plot may be 
adjusted using a simple slider. 

The sunburst tree is generated by mapping the 
UniProtKB assigned NCBI taxonomy identifiers onto 
the standard NCBI taxonomy. Unfortunately, there is 
not a perfect equivalence between taxonomy trees used 
by UniProtKB and NCBI, due simply to the fluid nature 
of the data and the different update cycles of the two re- 
sources. This mis-match inevitably generates cases where 
the mapping between the taxonomy identifiers in 
UniProtKB and NCBI breaks down. Species that cannot 
be assigned an exact node in the NCBI tree are shown as 
'Unclassified' in the sunburst. Furthermore, because the 
NCBI taxonomy contains numerous levels that are not 
present across all species, we have attempted to normalize 
taxonomic levels to the eight major ones (domain, 
kingdom, phyla, class, order, family, genus, species). For 
example, the lineage of Bos taurus contains the sub-family 
level Bovinae, which we skip over and connect the genus 
directly to the family level, Bovidae. Some hneages also 
omit one or more of the major levels. Again, in the case of 
B. taurus, the level 'order' is omitted and the missing level 
is flagged with 'No order'. We perform a node merger in 
the case of sub-species so that, for example, all sub-species 
of Escherichia coli are merged up to the species level and 
presented as E. coli sequences. These normalization steps 
allow us to draw every species with the same eight levels, 
making the outer ring of the sunburst complete and 
allowing the plot to represent more intuitively the distri- 
bution of sequences at each level. 

Reinstatement of the PfamAlyzer tool for complex 
architecture cjueries. PfamAlyzer (17) is a Java applet 
that provides a user-friendly graphical interface to Pfam 
(Figure IC). It was available in a previous version of the 
Pfam website (18) but was removed during development 
of the new website. It has now been reinstated and can be 
accessed through the search page. PfamAlyzer enables 
complex domain architecture queries to be specified 
using a simple drag-and-drop interface. The user can 
select a set of domains from drop-down hsts of Pfam-A 
famihes or Pfam clans and drag and arrange them to build 
a query architecture. PfamAlyzer use has been described 
in detail elsewhere (17,18). 

PFAM STATISTICS 

In our last NAR database paper (4), we reported on stat- 
istics from Pfam release 24.0. Here, we compare those 
numbers to our latest release, 26.0. 

General 

Pfam 26.0 comprises 13 672 Pfam-A famihes, an increase 
of ~15% with respect to Pfam 24.0. The total number of 
clans is now 499, up 18% since Pfam 24.0. Of the added 
famihes, 40% belong to clans. This brings the total 
number of families in clans to 31%, compared with 26% 
in release 24.0. Added families that are not in a clan are on 
an average much smaller than those in release 24.0 
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Table 1. UniProtKB and UniRefSO coverage comparison between 
Pfam release 24.0 and 26.0 





Pfam 


Pfam 




release 24.0 


release 26.0 


UniProtKB sequence coverage (%) 


75.1 


79.4 


UniProtKB amino acid coverage (%) 


S3.2 


S7.1 


UniRefSO sequence coverage (%) 


S8.2 


S7.7 


UniRefSO amino acid coverage (%) 


36.9 


36.6 



Release 24.0 coverage is calculated on UniProtKB 15.6 (August 2009) 
version and corresponding UniRefSO; release 26.0 coverage is calculated 
on UniProtKB 2011_06 version and corresponding UniRefSO. 



(average size of non-clan family in release 24.0 was 832 
members, compared to 337 for those added after release 
24.0). Finally, 34% of all new families are DUFs (8% of 
these belong to clans) bringing the total number of DUFs 
in release 26.0 to 3526. 

UniProtKB coverage 

Pfam uses UniProtKB as its reference sequence database. 
Between Pfam releases 24.0 and 26.0, UniProtKB has 
increased in size by 69% (9.4 million sequences in 
UniProtKB in August 2009 versus 15.9 million sequences 
in June 2011). Pfam seems to have coped well with the 
increase in number of sequences (Table 1), with 
UniProtKB sequence coverage up >4% since release 
24.0. Amino acid coverage has followed a very similar 
trend. In addition, the coverage of the redundancy- 
reduced sequence dataset UniRef50 (19) (redundancy 
reduced version of a dataset including UniProt and a 
number of other additional sequences from UniParc), 
decreased only slightly between releases 24.0 and 26.0. It 
is important to note, however, that coverage of UniRef50 
is ~20% lower with respect to coverage of a non- 
redundancy reduced UniProtKB database. This data indi- 
cates that Pfam has good coverage of the large, densely 
populated regions of protein space. The numerous less 
well-populated regions represent a significant challenge 
to all protein family databases, if the whole of protein 
space is ever to be completely represented by such 
databases. 

Coverage of complete proteomes 

An alternative way to measure Pfam growth is to assess 
the sequence and amino acid coverage of 'complete' 
genomes. Completed genomes provide relatively stable 
protein data sets, making it easier to assess changes in 
growth from release to release. Proteome sets are derived 
from the list of proteome FASTA files provided by 
Ensembl Genomes (20). Table 2 hsts the Pfam 26.0 
coverage of proteomes from a diverse set of organisms. 
The list is the same as that reported in 2010 (4), except 
that Bacillus subtilis has also been included. Generally, 
there has been an increase of 2-A percentage points in 
both amino acid and sequence coverage since release 
24.0. However, this trend is not observed in the large eu- 
karyotic genomes Homo sapiens, Gallus gallus, Mus 
musculus and Danio rerio, where sequence and/or amino 



acid coverage has remained similar or has even become 
lower compared to that reported previously. This obser- 
vation is explained by the fact that these four proteomes 
have substantially increased in size (number of proteins) 
over the past 2 years (increasing 25-71%), due to better 
integration of Ensembl data into UniProt. This has 
allowed us to improve cross-referencing between Pfam 
26.0 and Ensembl Genomes. We will use such coverage 
analysis to drive the selection of proteomes for family 
mining using jackhmmer, as described previously. 

Website usage 

The Pfam website continues to be widely used, both in 
terms of the geographic spread of users (see Figure 2) 
and in terms of the breadth of information retrieved 
from it. The various sequence search tools provided in 
the website are also heavily used. Taking as an example 
the period from 1 to 30 June 2011, a broadly representa- 
tive month in terms of overaU Pfam usage, we performed a 
total of 97 853 sequence searches across the three mirror 
sites. Of these, 93 871 were single-sequence searches for 
Pfam-A matches, while 3472 were single-sequence 
searches for Pfam-B matches. We also ran a total of 510 
offline multiple-sequence searches, which were submitted 
by >100 different users; offline search results are emailed 
back to the user once the search completes. 

A MORE DETAILED VIEW OF GATHERING 
THRESHOLDS AND DUF 

In this section, we discuss two topics. First, we address 
issues concerning statistical significance levels for inclu- 
sion of sequences into Pfam families; we regularly 
receive questions on this subject, which may indicate 
that the meaning of our family- specific gathering thresh- 
olds/cutoffs is not widely understood. Second, we 
continue (16) our analysis of DUFs, pointing to cases 
that may be of more interest for experimental functional 
characterization. 

Pfam gathering thresholds 

What are sequence and domain gathering thresholds? The 
gathering thresholds, or GAs, are manually curated, 
family-specific, bit score thresholds that are chosen by 
Pfam curators at the time a family is built. Every family 
is given two GAs, a 'sequence' threshold, and a 'domain' 
threshold. In HMMER, the sequence bit score is the sum 
of all scoring matches between the sequence and the 
profile HMM. The domain bit score is the score 
assigned to each reported match between the sequence 
and the profile HMM. For a protein region to be con- 
sidered as part of a family, both its sequence and 
domain bit scores must be equal to, or greater than, the 
corresponding GA. In families that contain sequences 
with multiple matches to the profile HMM, domain 
thresholds can be set to a value lower than sequence 
thresholds, in order to increase sensitivity. This is based 
on the assumption that finding multiple copies of a 
domain on the same sequence increases the chance of 
those instances being genuine matches, even when their 
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Table 2. Residue and sequence coverage of a number of complete proteomes in Pfam 26.0 



Species Sequence coverage (%) Amino acid coverage (%) 



Archaea 

Methanococcus vaimielii (strain SB/ATCC 35089) 86.7 65.9 

Methanosphaera stadtmanae (strain DSM 3091) 80.8 56.3 

Thermofilum pendens (strain Hrk 5) 73.1 54.7 
Bacteria 

Bacillus suhtilis 85.7 67.8 

Escherichia coli (strain MG1655) 93.4 71.8 

Helicobacter pylori (strain HPAGl) 80.4 60.3 

Pseiidomonas aeruginosa (strain UCBPP-PA14) 87.6 66.2 

Salmonella typhi (strain CT18) 85.7 68.3 

Staphylococcus aureus (strain MW2) 85.5 68.6 

Streptococcus pyogenes (strain MGAS9429) 81.9 67.4 

Thermus thermophilus (strain HB8) 84.0 64.5 

Yersinia pestis (strain Pestoides F) 89.0 67.7 
Eukaryota 

Anopheles gamhiae 77.9 42.8 

Arabidopsis thaliana 75.3 43.9 

Caenorhabditis elegans 67.2 40.2 

Danio rerio 85.1 45.8 

Dictyostelium discoideum (strain AX4) 60.4 28.7 

Drosophila melanogaster (strain Berkeley) 74.3 38.0 

Callus gallus 79.7 45.6 

Homo sapiens 69.7 44.4 

Leishmania hraziliensis 55.2 21.7 

Mus muscidus 73.9 44.1 

Paramecium tetraurelia 54.6 24.8 

Saccharomyces cerevisiae (strain ATCC 204508) 81.6 44.2 

Schizosaccharomyces pombe (strain ATCC 38366) 88.3 48.5 

Toxoplasma gondii (strain RH) 57.3 19.1 

Tetraodon nigroviridis 70.2 42.0 



ii-values are not very significant when taken in isolation. 
This is particularly true for Pfam families assigned the 
type 'repeat', where instances of the repeating unit 
within a sequence diverge substantially from the consen- 
sus. In practice, only 2.3% of all famihes have different 
sequence and domain GAs. 

Criteria for gathering threshold assignment. Fainily GAs 
are chosen with the goal of maximizing coverage while 
excluding any false positive matches. Although the 
number of false positives for a given threshold is generally 
unknown, one way to monitor the false positive-rate in- 
directly is to check for overlaps between one Pfam family 
and another. If the same region of a sequence matches two 
Pfam families, it should be considered a false positive in 
one of them. This holds true unless the two families are 
found to be in the same clan, i.e. the observed overlap is 
believed to reflect an evolutionary relationship between 
them. 

When building a new family, therefore, the GA choice is 
often influenced by overlaps with other families. In 
general, overlap-resolution between old and new famihes 
leads to GAs being raised over time, since one way to 
resolve the overlap is to raise the GA in one or the 
other of the families. This means, for example, that 
when the UniProtKB dataset underlying Pfam is 
updated, i.e. at every new release, numerous GAs need 
to be modified. This is because new sequences will have 
introduced many new overlaps and, as stated above, Pfam 
does not allow overlaps between families that are not in 



the same clan. In a few cases, these sequences wiU indeed 
be judged 'transitional' between two families and the 
families will be added to a same clan. In Figure 3, we 
compare the values of sequence GAs for famihes in 
Pfam release 24.0 with GAs of the same families in 
release 26.0. Overall, 13% of GAs have changed, of 
which 91% have been raised. 

Distribution of gathering thresholds of Pfam families and 
their relationship to E-values. The distributions of Pfain 
family GAs and corresponding is-values are shown in 
Figures 4A and B, respectively. The two GA peaks 
observed for intervals 25.0-26.0 and 27.0-28.0 
(Figure 4A) are due to the fact that numerous Pfain 
GAs (~27%) are set to fixed integer values of either 25.0 
or 27.0. This is also the cause of the bimodal ii-value dis- 
tribution seen in Figure 4B. Historically, a large number 
of Pfam families were assigned a reference GA of 25.0. 
More recently, we have used a higher (guidance) reference 
threshold of 27.0. These values correspond roughly to 
'safe' is-value thresholds of ~10~^ and their increase 
(from GA 25.0 to GA 27.0) reflects the increase in the 
size of the UniProtKB database (any particular bit score 
value wiU become less significant as the database size in- 
creases). In the absence of overlaps with other families, 
these thresholds are often left unchanged. In retrospect, 
however, these choices look too conservative, since most 
families that do not have thresholds of 25.0 or 27.0 have a 
distribution of GAs that is strongly shifted toward lower 
bit score values (median = 21.2, 25th percentile = 20.6, 
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Figure 3. Heat map showing sequence gathering threshold (GA) 
changes between Pfam releases 24.0 and 26.0. Yellow squares represent 
high density; red squares represent low density. Squares on the diagonal 
correspond to GAs that are unchanged; squares in the region above the 
diagonal are GAs that have increased; and squares below the diagonal 
are GAs that have decreased. For the sake of clarity, we chose to show 
a zoomed-in version of the complete plot, which also includes a number 
of points outside of the range seen here. The plot was created using R 
(21). 



75th percentile = 22.8) (Figure 4C, right side). This seems 
to indicate that most reference thresholds could be 
lowered, thereby increasing coverage. 

Figure 4D reports the distribution of ii-values that cor- 
respond to family GAs for all famihes (left side) or 
excluding those families with GA 25.0 or 27.0 (right 
side). In the latter case the distribution has a median of 
0.18, a 25th percentile of 0.057 and a 75th percentile of 
0.27. A handful of families have ii-values that, at first 
sight, appear to be either too high or too low. In particu- 
lar, there are 72 families with is-value >1, and 82 with 
ii-value <10~^. A survey of these families indicates the 
following. High ii-values arise for two possible reasons. 
Firstly, the is-value may have been set high because the 
model is very short and a more 'realistic' ii-value would 
result in no matching sequences being reported. 
Alternatively, the high ii-value may have been chosen 
because that was the relevant value for the size of the 
sequence database when the model was first built and 
this family has not been revised subsequently because no 
overlaps have been introduced. Low ii-values, which often 
correspond to very long profile HMMs, are hkely to have 
been set low in order to avoid inclusion of sequences from 
other families (overlaps). These overlaps frequently origin- 
ate from the biased distribution of amino acids in these 
particular profile HMMs, such that the profile HMMs are 
too generic and capture equally biased but unrelated se- 
quences (one example is coiled-coil famihes). We will 
revisit some of the families that may need re-thresholding 
in the near future, as part of a larger scale analysis of 



manually set GAs, and their discriminatory power 
versus using a uniform fixed threshold. 

DUFs and Uncharacterized Protein Families 

DUFs and uncharacterized protein families (UPFs) (here- 
after simply referred to as 'DUFs') are families that lack 
any functional annotation in Pfam. They currently consti- 
tute more than a quarter of all Pfam families and their 
number has been steadily increasing over the last few years 
(Figure 5A, blue hue, and Bateman et al. (16)). As previ- 
ously reported (16), DUFs are, on an average, less widely 
distributed on the evolutionary tree than functionally 
annotated families. For this reason, despite representing 
26.5% of all families, they account for only 6.7% of Pfam 
26.0 sequence coverage of UniProtKB. 

Normally, when function for at least one protein in a 
DUF has been experimentally determined, the family is 
renamed. Although the number of functionally 
characterized DUFs that have been renamed is on 
the rise (Figure 5A, red hne), this rise is easily outpaced 
by the number of DUFs that are being newly 
generated (Figure 5A, blue line). To compound the 
problem, we have struggled to keep up with pubhshed 
functional studies for such a large number of DUFs. 
We are therefore in the process of reviewing the scientific 
Hterature, with the aim of improving our annotation of 
DUFs. 

Pfam also contains numerous domains, e.g. YbbR 
(PF07949) or YfcL (PF08891), that have been named 
after representative proteins from bacteria such as 
ii. coli and B. subtilis, but whose function remains 
uncharacterized. We would like to make Pfam users (es- 
pecially those using Pfam for large scale studies) aware of 
the fact that these famihes with an assigned name, i.e. a 
name other than DUF/UPF, still remain in the category of 
domain of unknown function. We estimate the number of 
uncharacterized Pfam famihes that are not named DUF to 
be around 700, although this figure is likely to be an 
under-estimate. 

DUFs include numerous families that are potentially of 
great interest for experimental characterization of 
function. Among these are ~300 DUFs that are found 
in >100 representative genome clusters (using the set of 
clusters with 35% cut-off from PIR (23)) (Figure 5B). The 
wide taxonomic distribution of these DUFs suggests that 
they are likely to be associated with important functions in 
the ceU. Furthermore, the Pfam DUF Hst includes >400 
families with at least one human protein. 

Two interesting examples of former DUF families that 
have been characterized in recent years are DUF26 and 
DUF1017. DUF26 (PF01657), now annotated as sah 
stress response/antifungal family, has been found as a 
duplicated domain in the Oryza sativa root meander 
curling (OsRMC) protein, where it plays a role in salt 
stress response (24). It is also found in ginkbilobin-2 
from Ginkgo biloba, which possesses anti-fungal activity 
(25). The crystal structure of ginkbilobin-2 has been 
determined (26) and, as a result of this, we have been 
able to extend the boundaries of PF01657 to encompass 
the entire domain. In the second example, ii. coli GfcC 
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Figure 4. Distribution of sequence gatliering (GA) thresholds and of corresponding ii-values. (A) Distribution of sequence GAs for all Pfam-A 
families. Note that intervals are such that, for example, "25-26' translates into 25 < sequence GA(bits) < 26. (B) Same as the histogram in panel (A), 
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belongs to DUF1017 (PF06251). This protein is known to 
play a role in group 4 capsule (G4C) polysaccharide bio- 
synthesis (27), so the family has been re-annotated as 
capsule biosynthesis GfcC. The crystal structure of GfcC 
has recently been published (28) and, based on this struc- 
ture, we have again extended the boundaries of this 
family. 

Part of our motivation for creating DUF families is that 
we hope to provide information that can guide or accel- 
erate functional characterization of these domains. Data 
that may be retrieved from the Pfam website for such 
families include alignments that pinpoint sequence 



conservation, species distribution and domain 
co-occurrence. These can help elucidate the evolutionary 
origin of the family and, in some cases, reduce the number 
of functional hypotheses. Information about 
co-occurrence with annotated domains, for example, is 
of value because it points to functional processes in 
which DUF families may be involved. In our latest 
release (26.0), we find that 23% of all DUFs co-occur 
with at least one annotated domain and that 76 of them 
are found in a single architecture in combination with at 
least one annotated domain (note: we only consider archi- 
tectures with at least five members) (Figure 5C). In the 
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that a PDB structure is available for a member of the same clan. 



case of DUFs for which the structure of one or more 
members is available, structural information can be effect- 
ively combined with sequence conservation, for example, 
to highlight putative binding sites for small hgands, 
proteins or DNA. Some 26% of DUFs have at least one 
structurally determined protein within the family or within 
the clan to which the family belongs [Figure 5D; see also 
Jaroszewski et al. (29)]. Taken alone, this information is 
unhkely to be enough to confidently assign function to a 
family, but it can be sufficient to identify interesting 
targets for experimental characterization. 



CONCLUDING REMARKS 

Pfam is a database of protein sequence families. Each 
Pfam family is represented by a statistical model, known 
as a profile hidden Markov model, which is 'trained' using 
a curated ahgnment of representative sequences. These 
models can be searched against protein sequences in 
order to find occurrences of Pfam families, thereby 
aiding the identification of evolutionarily related (or hom- 
ologous) sequences. As homologous proteins are more 
hkely to share structural and functional features, Pfam 
families can aid in the annotation of uncharacterized se- 
quences and guide experimental work. Despite the 



continued growth of the sequence databases, Pfam has 
maintained and even increased its coverage of 
UniProtKB. Over the coming years we will continue to 
add new families to Pfam. These, as ever, will come 
from a variety of sources, in particular, the Protein Data 
Bank (PDB) and the analysis of complete proteomes for 
sequences not matched by Pfam. As new data become 
available, we will also re-visit existing famihes, to 
improve their annotation, sequence diversity and domain 
boundaries as necessary. Use of structural information, in 
particular, wiU help us improve domain definitions and 
increase coverage of UniProtKB at the amino acid level. 
At the same time, we plan to revise clan organization in 
order to further increase representation in dense areas of 
sequence-space. Finally, we hope that the systems that we 
have put in place to allow external contributions, be it via 
Wikipedia or directly into the Pfam database, wiU engage 
scientists and motivate them to contribute their knowledge 
and experimental results to Pfam, a community resource 
for all. 
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